Introduction

Predicting football match outcomes has long intrigued statisticians and sports analysts. The inherently random nature of sports events, coupled with various influencing factors like team strength, players performance, and home advantage, presents a challenging yet fascinating problem for statistical modeling.

In previous projects, I have explored goal-based models such as those proposed by Maher and Dixon and Coles. Maher’s model describes the outcome of a match by modeling the goals of the two teams independently using Poisson distributions. Dixon and Coles enhanced this basic model by introducing correlation between the teams’ goal-scoring processes and accounting for the home advantage effect. While these models have been widely used due to their simplicity and effectiveness, they rely on the assumption that goals follow a Poisson distribution. This assumption, however, can be questionable in certain leagues where overdispersion—where the sample variance exceeds the sample mean— has been observed in the number of goals.

To address this limitation, Karlis and Ntzoufras proposed an alternative approach by focusing on the goal difference rather than the individual goal counts of each team. Their method utilizes the Skellam distribution, which models the difference between two independent Poisson-distributed random variables. This shift in focus eliminates the need to account for the correlation between teams’ scores directly and avoids the assumption of Poisson marginals. Although the Skellam-based model cannot predict exact match scores, it offers valuable insights into the likely goal difference, thereby simplifying the complexity associated with modeling individual goal counts. The application of Bayesian methods in this context allows for the incorporation of prior knowledge and continuous updating of model parameters as new data becomes available, enhancing the predictive power and adaptability of the model.


Project structure

Spiegare la struttura, dire che mi sono focalizzato su serie A 2021-22 e che ho usato Stan TO-BE-DONE

📂 project/
│ 
├── 📂 stan/
│ 
├── 📂 data/ 
│   ├── 📄 ...
│   └── 📄 ...
...
library(ggplot2)
library(ggimage)
library(patchwork)
library(tidyverse)
library(dplyr,verbose = FALSE)
library(knitr)
library(kableExtra)
library(bayesplot)
source("../utils/my_skellam.R")
source("../utils/get_all_teams_data.R")
source("../utils/plot_parameters_ts.R")


N_CHAINS=4
N_ITERS=11000
N_WARMUP=1000
DATA_DIR= "../data/"
STAN_DIR= "../stan/"
SEASON="2122"
OFFLINE_MODELS_DIR= paste0("../estimated_models/season_",SEASON,"/offline_models/")
ONLINE_MODELS_DIR= paste0("../estimated_models/season_",SEASON,"/online_models/")

Exploratory Data Analysis

Before delving into model construction, it is crucial to gain a comprehensive understanding of the data. The exploratory data analysis (EDA) phase, through its visualizations, provide a glimpse into the underlying dynamics but also inform the modeling approach by highlighting factors that influence match outcomes. By analyzing various aspects of the data, such as goal differences, result distribution, and team performance, we are going to lay the groundwork for developing the models.

SerieA_data<- read.csv(file= paste0(DATA_DIR,"season_",SEASON,"/SerieA_",SEASON,".csv"))
SerieA_data<- SerieA_data %>% select(c("HomeTeam","AwayTeam","FTHG","FTAG","FTR"))
SerieA_data<- SerieA_data %>% mutate(GD=FTHG-FTAG)
teams<- unique(SerieA_data$HomeTeam)
n_games<- nrow(SerieA_data)
n_teams<- length(teams)
n_matchdays= ceiling(n_games/((n_teams)/2))
ht= unlist(sapply(1:n_games,function (g) which(teams==SerieA_data$HomeTeam[g])))
at= unlist(sapply(1:n_games,function (g) which(teams==SerieA_data$AwayTeam[g])))
SerieA_data$ht=ht
SerieA_data$at=at
teams<- str_replace_all(teams, " ", "")

The initial variable I decided to inspect was, of course, the goal differences, as it represents our phenomenon of interest. The plot below shows the empirical data and overlays it with theoretical Skellam distributions (both the standard and the zero-inflated version), and highlights how well the theoretical distributions align with the observed data.

Next, we investigate the distribution of results (encoded as Home-win, Draw, and Away-win) across the entire season. As depicted in the plot below, HomeWin emerges as the majority class. This observation informs our modeling approach, where we will introduce a home effect parameter to account for the slight advantage of playing matches in one’s own stadium.

Another interesing aspect to explore are the goal statistics by team. In particular, by examining goals scored and conceded by each team, we gain valuable insights into the potential ranking of attack and defense coefficients in our upcoming model. Additionally, distinguishing between home and away matches underscores the significance of incorporating the home effect as a model parameter. These visualizations not only provide a glimpse into the team dynamics but also inform the modeling approach by highlighting factors that influence match outcomes.

# Preliminary steps required to make plots
home_scored <- aggregate(SerieA_data$FTHG, list(SerieA_data$HomeTeam), FUN = sum) 
away_scored <- aggregate(SerieA_data$FTAG, list(SerieA_data$AwayTeam), FUN = sum)
home_conceeded <- aggregate(SerieA_data$FTAG, list(SerieA_data$HomeTeam), FUN = sum) 
away_conceeded <- aggregate(SerieA_data$FTHG, list(SerieA_data$AwayTeam), FUN = sum)

colnames(home_scored) <- c("team", "home_scored")
colnames(away_scored) <- c("team", "away_scored")
colnames(home_conceeded) <- c("team", "home_conceeded")
colnames(away_conceeded) <- c("team", "away_conceeded")

goals_scored <- merge(home_scored, away_scored, by = "team")
goals_scored$tot_goals <- goals_scored$home_scored + goals_scored$away_scored

goals_conceeded <- merge(home_conceeded, away_conceeded, by = "team")
goals_conceeded$tot_goals <- goals_conceeded$home_conceeded + goals_conceeded$away_conceeded

Comment: only five teams (out of a total of twenty) have scored more goals away than at home. Except for Napoli, which has scored the same number of goals at home and away, all other teams have had better offensive performances when playing in their own stadium. Symmetrically, only six teams have conceded more goals at home than away. All other teams have shown better defensive performances when playing in their own stadium (Inter, Salernitana, and Udinese, actually conceded the same number of goals at home and away).

The existence of a home effect is suggested also by the useful results achieved by each team, differentiated between outcomes at home and away:

useful_results_df<- SerieA_data %>%
  group_by(HomeTeam) %>%
  summarize(UsefulResultsHome = sum(FTR %in% c("H", "D"))) %>%
  left_join(SerieA_data %>% 
              group_by(AwayTeam) %>%
              summarize(UsefulResultsAway = sum(FTR %in% c("D", "A"))),
            by = c("HomeTeam" = "AwayTeam")) %>%
  rename(Team=HomeTeam)

Finally, an interesting insight that can be extracted from our data lies in the Final Rankings of the league:

SerieA_data <- SerieA_data %>%
  mutate(HomePoints = ifelse(FinalResult == "HomeWin", 3,
                             ifelse(FinalResult == "Draw", 1, 0)),
         AwayPoints = ifelse(FinalResult == "AwayWin", 3,
                             ifelse(FinalResult == "Draw", 1, 0)))

total_points <- SerieA_data %>%
  select(HomeTeam,AwayTeam,FTHG,FTAG,HomePoints,AwayPoints) %>%
  group_by(Team = HomeTeam) %>%
  summarise(TotalPoints = sum(HomePoints),
            GS = sum(FTHG),
            GC = sum(FTAG)) %>%
  bind_rows(
    SerieA_data %>%
      group_by(Team = AwayTeam) %>%
      summarise(TotalPoints = sum(AwayPoints),
                GS = sum(FTAG),
                GC = sum(FTHG))
  ) %>%
  group_by(Team) %>%
  summarise(TotalPoints=sum(TotalPoints),
            GS = sum(GS),
            GC = sum(GC)) %>%
  arrange(desc(TotalPoints)) %>%
  mutate(Position = row_number(),
         GD =GS-GC) %>%
  select(Position, Team, TotalPoints,GS,GC,GD) 
Serie A 2021-2022 final rankings
Position Team TotalPoints GS GC GD
1 Milan 86 69 31 38
2 Inter 84 84 32 52
3 Napoli 79 74 31 43
4 Juventus 70 57 37 20
5 Lazio 64 77 58 19
6 Roma 63 59 43 16
7 Fiorentina 62 59 51 8
8 Atalanta 59 65 48 17
9 Verona 53 65 59 6
10 Sassuolo 50 64 66 -2
11 Torino 50 46 41 5
12 Udinese 47 61 58 3
13 Bologna 46 44 55 -11
14 Empoli 41 50 70 -20
15 Sampdoria 36 46 63 -17
16 Spezia 36 41 71 -30
17 Salernitana 31 33 78 -45
18 Cagliari 30 34 68 -34
19 Genoa 28 27 60 -33
20 Venezia 27 34 69 -35

Model Formulation

As mentioned in the Introduction, we can use the Skellam distribution to model goal difference of a football match. More precisely, the goal difference of the match between teams \(i\) and \(j\) (where we adopt the convention of \(i\) being the home team) is described by: \[\begin{equation} GD_{i,j} \sim Skellam\left(\theta^{(H)}_{(i,j)},\theta^{(A)}_{(i,j)}\right) \end{equation}\] with: \[\begin{align*} \theta^{(H)}_{(i,j)} &= \exp\{\mu + h_{eff} + att_{(i)} + def_{(j)}\}\\[0.2cm] \theta^{(A)}_{(i,j)} &= \exp\{\mu + + att_{(j)} + def_{(i)}\} \end{align*}\]

Under the above parametrization, all parameters have a straightforward interpretation:

Some notes:

In my previous experience with goal-based models, I noticed that the number of draws (and the corresponding probability) tends to be underestimated. To address this, I explored the idea proposed by Karlis and Ntzoufras of using a zero-inflated version of the Skellam distribution. This approach seemed promising to me, and its probability distribution is defined as: \[\begin{align*} f_{ZIS}\left(x,\theta_1,\theta_2,p\right) = \begin{cases} (1-p)\cdot f_{S}\left(x,\theta_1,\theta_2\right)\hspace{0.5cm}&\text{if $x\neq0$}\\[0.3cm] p+(1-p)\cdot f_{S}\left(x,\theta_1,\theta_2\right)\hspace{0.5cm}&\text{if $x=0$} \end{cases} \end{align*}\] with \(x \in \mathbb{Z},\theta_1,\theta_2 >0\) and \(p \in [0,1]\)

To fully specify a Bayesian model, we need to define prior distributions for the model parameters. In my analysis, I chose not to incorporate any external information from previous seasons. In such cases, Karlis and Ntzoufras suggested using normal prior distributions with a mean of zero and a large variance (e.g. \(10^4\)) to express prior ignorance. However, since the Stan documentation discourages non-informative priors, I set my priors to be weakly informative (still I assumed them to be normal):

\[\begin{align*} \mu \sim N(0,10)\\[0.2cm] H_{eff} \sim N(0,10)\\[0.2cm] att_{(\bullet)} \sim N(0,10)\\[0.2cm] def_{(\bullet)} \sim N(0,10)\\[0.2cm] p \sim U(0,1) \end{align*}\]

In addition to the standard approach, I considered an online learning approach. In the Bayesian framework, we typically start with a prior distribution and update it with observed data to obtain a posterior distribution. When new data becomes available, this posterior can serve as the prior for the next update and so on. Despite concerns about specifying overly informative priors, I was curious to compare the online learning approach with the traditional “offline” model, where the “information” that updates our knowledge derives (almost) only from the data.

Note: in principle we don’t know the functional form of the posterior distribution. However, in my analysis I assumed my priors to be normal, updating only their location and scale matchday by matchday.

The formulation of the online model has the same likelihood of the offline one, except from the priors specification. More precisely, at a certain matchday \(m\), the priors will be the following:

\[\begin{align*} \mu &\sim N\big(MAP(\mu)^{(m-1)},\sigma_{\mu}^{(m-1)}\big)\\[0.2cm] H_{eff} &\sim \big(MAP(H_{eff})^{(m-1)},\sigma_{H_{eff}}^{(m-1)}\big)\\[0.2cm] att_{(\bullet)} &\sim \big(MAP(att_{(\bullet)}\;)^{(m-1)},\sigma_{att_{(\bullet)}}^{(m-1)}\big)\\[0.2cm] def_{(\bullet)} &\sim \big(MAP(def_{(\bullet)}\;)^{(m-1)},\sigma_{def_{(\bullet)}}^{(m-1)}\big)\\[0.2cm] p &\sim U(0,1) \end{align*}\]


Model implementation in Stan

The models have been fully implemented using the Stan ecosystem and managed in R through the library rstan. I decided to follow the procedure illustrated by Karlis and Ntzoufras in their article, setting the following MCMC parameters:

In what follows the main blocks of the stan files are illustrated.

Offline model

  • Functions: I had to define some custom functions, since Stan does not provide the log pmf of a Skellam

    functions {
      real skellam_lpmf(int k, real lambda1, real lambda2) {
        real total;
        real log_prob;
        total = (- lambda1 - lambda2) + (log(lambda1) - log(lambda2)) * k / 2;
        log_prob = total + log(modified_bessel_first_kind(k, 2 * sqrt(lambda1*lambda2)));
        return log_prob;
      }
      real zero_inflated_skellam_lpmf(int k, real lambda1, real lambda2, real p) {
        real base_prob;
        real prob;
        real log_prob;
        base_prob = exp(skellam_lpmf(k|lambda1, lambda2));
        if (k == 0){
          prob = p + (1 - p) * base_prob;
        }
        else{
          prob = (1 - p) * base_prob;
        }
        log_prob = log(prob);
        return log_prob;
      }
    }
  • Data:

    data {
      int<lower=1> n_teams;
      int<lower=1> n_games;
      array[n_games] int<lower=1, upper=n_teams> home_team;
      array[n_games] int<lower=1, upper=n_teams> away_team;
      array[n_games] int goal_difference;
    }
  • Parameters and transformed parameters

    parameters {
      real<lower=0, upper=1> p;
      real mu;
      real home_advantage;
      array[n_teams-1] real att_raw;
      array[n_teams-1] real def_raw;
    }
    
    transformed parameters {
      // Sum-to-zero constraint
      array[n_teams] real att;
      array[n_teams] real def;
    
      for (t in 1:(n_teams-1)) {
        att[t] = att_raw[t];
        def[t] = def_raw[t];
      }
    
      att[n_teams] = -sum(att_raw);
      def[n_teams] = -sum(def_raw);
    }
  • Model

    model {
      array[n_games] real theta_H;
      array[n_games] real theta_A;
      // Priors
      p ~ uniform(0, 1);
      att_raw ~ normal(0, 10);
      def_raw ~ normal(0, 10);
      home_advantage ~ normal(0, 10);
      mu ~ normal(0, 10);
      // Likelihood
      for (g in 1:n_games) {
        theta_H[g] = exp(mu + home_advantage +att[home_team[g]] + def[away_team[g]]);
        theta_A[g] = exp(mu + att[away_team[g]] + def[home_team[g]]);
        goal_difference[g] ~ zero_inflated_skellam(theta_H[g],theta_A[g],p);
      }
    }
  • Generated quantities: I couldn’t exploit the generated quantities block since Stan does not provide any random generator for the Skellam distribution (skellam_rng()). Some users by-pass this limitations by generating 2 values from a Poisson (with poisson_rng()) and computing their subtraction, but this would be in contrast with our setup, in which we don’t assume Poisson marginals. However, the generated quantities can be computed in R even after the model has been fitted! I opted for this latter approach, referring to the skellam library without the need of re-inventing the wheel.

Technical Note: the skellam library provided a rskellam() function, but the documentation specified that it worked internally by generating 2 random Poissons (with rpois()), which it the mechanism I wanted to avoid :( Nevertheless, I could easily implement a custom Skellam generator through the inverse-sampling method, exploiting the qskellam() function from the library, as shown below:

library(skellam)
#-------------------------------------------------------------------------------
# 1) Random generator for standard skellam
my_rskellam<- function(n,lambda1,lambda2){
  q= runif(n,min=0,max=1)
  return(qskellam(q,lambda1,lambda2))
}
#-------------------------------------------------------------------------------
# 2) Random generator for zero-inflated skellam
my_rzeroinflatedskellam<- function(n,lambda1,lambda2,p){
  samples<- vector(mode="numeric",length = n)
  for(i in 1:n){
    # Generate 0 with probability p
    # Or a random number from a standard skellam with prob (1-p)
    if(runif(1)>p){
      samples[i]<- my_rskellam(1,lambda1,lambda2)
    }
    #else not needed, since there are already zeros!
  }
  return(samples)
}
#-------------------------------------------------------------------------------

Online model

The stan code for the online model is similar to the previous one, with the major differences in the data block and in the priors definition (model block)

data {
  int<lower=1> n_teams;
  int<lower=1> n_games;
  array[n_games] int<lower=1, upper=n_teams> home_team;
  array[n_games] int<lower=1, upper=n_teams> away_team;
  array[n_games] int goal_difference;
  // Previous estimates and sd
  array[n_teams-1] real prev_att_MAP;
  array[n_teams-1] real prev_def_MAP;
  real prev_mu_MAP;
  real prev_home_advantage_MAP;
  array[n_teams-1] real<lower=0> prev_att_sd;
  array[n_teams-1] real<lower=0> prev_def_sd;
  real<lower=0> prev_mu_sd;
  real<lower=0> prev_home_advantage_sd;
}

model {
  ...
  // Priors
  p ~ uniform(0, 1);
  for(a in 1:(n_teams-1)){
    att_raw[a] ~ normal(prev_att_MAP[a], prev_att_sd[a]);
    def_raw[a] ~ normal(prev_def_MAP[a], prev_def_sd[a]);
  }
  home_advantage ~ normal(prev_home_advantage_MAP, prev_home_advantage_sd);
  mu ~ normal(prev_mu_MAP, prev_mu_sd);
  ...
}

Important notes

  • Due to the extensive time required to fit the models, I am not including the code here (it is not particularly interesting, though). However, I have fitted the models and stored them in a dedicated folder, and they are available in the repository.
  • The bayesian procedure of fitting the model matchday by matchday is, in practice, done on a weekly basis. With access to the complete dataset for the analyzed season, I was able to fit the models in a for loop, iterating over each matchday. The procedure for each iteration is as follows:
    1. Select the rows of the dataframe corresponding to the available information up to the current matchday.
    2. Load the model estimated at the end of the previous matchday from the models folder.
    3. Fit the model and store the updated model in the dedicated folder.

Posterior distributions of the coefficients

After fitting my Bayesian models, I used the bayesplot package to generate informative plots of the posterior distributions of the model parameters. To provide a comprehensive overview, I will present the plots for the “final model”, which is the one fitted at the end of the season, with all the available data. These plots will be shown for both the offline and online models, with a clear distinction between the attack parameters and the defense parameters.

Plots for offline models

load(paste0(OFFLINE_MODELS_DIR,"matchday38/KN_matchday38.rds"))
posterior=as.array(KN_model)
#-------------------------------------------------------------------------------
# Lazy code to retrieve only the parameters of interest
par_names<-  names(KN_model)
useful_par_names<- par_names[!(grepl("raw", par_names))]
att_params <- useful_par_names[grepl("att", useful_par_names)]
def_params <- useful_par_names[grepl("def", useful_par_names)]
att_labels <- setNames(teams, paste0("att[", 1:20, "]"))
def_labels <- setNames(teams, paste0("def[", 1:20, "]"))
#-------------------------------------------------------------------------------
#-------------------------------------------------------------------------------
# ATTACK COEFFICIENTS
color_scheme_set("blue")
att_intervals_offline<- mcmc_intervals(posterior,pars=att_params)+
  scale_y_discrete(labels=att_labels)+
  ggtitle("MCMC intervals for attack coefficients (offline)")

att_areas_offline<- mcmc_areas(posterior,pars= att_params)+
  scale_y_discrete(labels=att_labels)+
  ggtitle("MCMC areas for attack coefficients  (offline)")

att_dens_overlay_offline<- mcmc_dens_overlay(posterior,pars=att_params)+
  ggtitle("Marginal posteriors for attack coefficients by chain  (offline)")
att_dens_overlay_offline[[1]]$Parameter = rep(att_labels,
                                             each=N_CHAINS*(N_ITERS-N_WARMUP))
#-----------------------------------------------------------------------------
# DEFENSE COEFFICIENTS
 color_scheme_set("red")
def_intervals_offline<- mcmc_intervals(posterior,pars=def_params)+
   scale_y_discrete(labels=def_labels)+
   ggtitle("MCMC intervals for defense coefficients (offline)")
 
def_areas_offline<- mcmc_areas(posterior,pars= def_params)+
  scale_y_discrete(labels=def_labels)+
  ggtitle("MCMC areas for defense coefficients (offline)")
 
def_dens_overlay_offline<- mcmc_dens_overlay(posterior,pars=def_params)+
  ggtitle("Marginal posteriors for defense coefficients by chain (offline)")
def_dens_overlay_offline[[1]]$Parameter = rep(def_labels,
                                             each=N_CHAINS*(N_ITERS-N_WARMUP))
#-------------------------------------------------------------------------------
# HOME ADVANTAGE
home_areas_offline<- mcmc_areas(posterior,pars="home_advantage")+
  ggtitle("MCMC areas for home advantage (offline)")
home_dens_overlay_offline<- mcmc_dens_overlay(posterior,pars="home_advantage")+
  ggtitle("Marginal posteriors for home advantage by chain (offline)")

Plots for online models

load(paste0(ONLINE_MODELS_DIR,"matchday38/KN_matchday38.rds"))
posterior=as.array(KN_model)
#-------------------------------------------------------------------------------
# Lazy code to retrieve only the parameters of interest
par_names<-  names(KN_model)
useful_par_names<- par_names[!(grepl("raw", par_names))]
att_params <- useful_par_names[grepl("att", useful_par_names)]
def_params <- useful_par_names[grepl("def", useful_par_names)]
att_labels <- setNames(teams, paste0("att[", 1:20, "]"))
def_labels <- setNames(teams, paste0("def[", 1:20, "]"))
#-------------------------------------------------------------------------------
#-------------------------------------------------------------------------------
# ATTACK COEFFICIENTS
color_scheme_set("blue")
att_intervals_online<- mcmc_intervals(posterior,pars=att_params)+
  scale_y_discrete(labels=att_labels)+
  ggtitle("MCMC intervals for attack coefficients (online)")

att_areas_online<- mcmc_areas(posterior,pars= att_params)+
  scale_y_discrete(labels=att_labels)+
  ggtitle("MCMC areas for attack coefficient (online)")

att_dens_overlay_online<- mcmc_dens_overlay(posterior,pars=att_params)+
  ggtitle("Marginal posteriors for attack coefficients by chain (online)")
att_dens_overlay_online[[1]]$Parameter = rep(att_labels,
                                             each=N_CHAINS*(N_ITERS-N_WARMUP))
#-------------------------------------------------------------------------------
# DEFENSE COEFFICIENTS
color_scheme_set("red")
def_intervals_online<- mcmc_intervals(posterior,pars=def_params)+
   scale_y_discrete(labels=def_labels)+
   ggtitle("MCMC intervals for defense coefficients (online)")

def_areas_online<- mcmc_areas(posterior,pars= def_params)+
  scale_y_discrete(labels=def_labels)+
  ggtitle("MCMC areas for defense coefficients (online)")
# 
def_dens_overlay_online<- mcmc_dens_overlay(posterior,pars=def_params)+
  ggtitle("Marginal posteriors for defense coefficients by chain (online)")
def_dens_overlay_online[[1]]$Parameter = rep(def_labels,
                                             each=N_CHAINS*(N_ITERS-N_WARMUP))
#-------------------------------------------------------------------------------
# HOME ADVANTAGE
home_areas_online<- mcmc_areas(posterior,pars="home_advantage")+
  ggtitle("MCMC areas for home advantage (online)")
home_dens_overlay_online<- mcmc_dens_overlay(posterior,pars="home_advantage")+
  ggtitle("Marginal posteriors for home advantage by chain (online)")

MCMC Intervals plots

MCMC Areas plots

Marginal posteriors (by chain)

Home advantage

Teams comparison

Abilities over time

Teams’s abilities over time (offline model)

Teams’s abilities over time (online model)


A glimpse to convergence diagnostics

The training phase of the models seemed promising, as the MCMC procedure completed without any warnings or errors. To assess the convergence of the chains, we can inspect the summary of the models and look at the Gelman-Rubin statistics \(\hat{R}\) (Rhat), which provide a quantitative measure of the chains’ convergence to the steady state.

Offline model

## Inference for Stan model: anon_model.
## 4 chains, each with iter=11000; warmup=1000; thin=1; 
## post-warmup draws per chain=10000, total post-warmup draws=40000.
## 
##                 mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## mu             -0.19       0 0.11 -0.40 -0.26 -0.19 -0.11  0.02 27901    1
## home_advantage  0.12       0 0.07 -0.01  0.08  0.12  0.17  0.26 47969    1
## att[1]          0.68       0 0.17  0.33  0.56  0.68  0.80  1.02 36874    1
## att[2]          0.06       0 0.23 -0.40 -0.09  0.07  0.22  0.51 40797    1
## att[3]         -0.25       0 0.29 -0.84 -0.44 -0.24 -0.05  0.29 40810    1
## att[4]         -0.10       0 0.24 -0.58 -0.25 -0.09  0.07  0.35 38297    1
## att[5]         -0.05       0 0.26 -0.58 -0.22 -0.04  0.13  0.44 42691    1
## att[6]          0.49       0 0.22  0.04  0.34  0.49  0.64  0.91 35893    1
## att[7]          0.50       0 0.18  0.15  0.39  0.51  0.62  0.85 37597    1
## att[8]          0.28       0 0.21 -0.14  0.14  0.29  0.43  0.69 40854    1
## att[9]         -0.73       0 0.40 -1.57 -0.98 -0.71 -0.46 -0.02 39762    1
## att[10]         0.29       0 0.24 -0.19  0.13  0.29  0.45  0.74 38666    1
## att[11]         0.30       0 0.21 -0.12  0.16  0.30  0.44  0.71 44720    1
## att[12]         0.67       0 0.19  0.29  0.54  0.67  0.80  1.04 35146    1
## att[13]         0.53       0 0.21  0.12  0.39  0.53  0.67  0.93 37462    1
## att[14]        -0.04       0 0.21 -0.47 -0.18 -0.04  0.10  0.36 42073    1
## att[15]        -1.28       0 0.52 -2.41 -1.61 -1.24 -0.92 -0.35 35030    1
## att[16]         0.29       0 0.23 -0.18  0.14  0.30  0.45  0.73 35889    1
## att[17]         0.29       0 0.19 -0.08  0.16  0.29  0.42  0.65 42743    1
## att[18]        -0.51       0 0.39 -1.33 -0.77 -0.50 -0.24  0.20 37348    1
## att[19]        -0.39       0 0.32 -1.05 -0.60 -0.38 -0.17  0.20 40775    1
## att[20]        -1.03       0 0.43 -1.93 -1.30 -1.01 -0.74 -0.26 37245    1
## def[1]         -0.54       0 0.41 -1.42 -0.80 -0.51 -0.25  0.20 35246    1
## def[2]         -0.13       0 0.26 -0.66 -0.30 -0.13  0.05  0.35 39088    1
## def[3]          0.22       0 0.20 -0.18  0.09  0.22  0.36  0.61 41929    1
## def[4]         -0.28       0 0.27 -0.82 -0.46 -0.28 -0.10  0.22 39696    1
## def[5]          0.18       0 0.22 -0.25  0.04  0.18  0.33  0.60 39619    1
## def[6]          0.41       0 0.23 -0.05  0.25  0.41  0.56  0.85 36563    1
## def[7]         -0.65       0 0.39 -1.48 -0.90 -0.63 -0.38  0.06 37040    1
## def[8]         -0.13       0 0.27 -0.69 -0.31 -0.12  0.06  0.39 40010    1
## def[9]          0.28       0 0.19 -0.09  0.16  0.29  0.41  0.66 39832    1
## def[10]         0.54       0 0.19  0.16  0.42  0.54  0.67  0.91 37093    1
## def[11]        -0.14       0 0.28 -0.71 -0.32 -0.13  0.05  0.39 44269    1
## def[12]         0.34       0 0.24 -0.14  0.19  0.35  0.51  0.80 36020    1
## def[13]         0.37       0 0.23 -0.08  0.22  0.37  0.53  0.80 39245    1
## def[14]        -0.85       0 0.35 -1.58 -1.07 -0.83 -0.61 -0.20 43395    1
## def[15]         0.11       0 0.20 -0.27 -0.02  0.11  0.25  0.50 45973    1
## def[16]         0.29       0 0.23 -0.16  0.14  0.29  0.44  0.72 35305    1
## def[17]        -1.11       0 0.46 -2.10 -1.40 -1.08 -0.79 -0.28 35375    1
## def[18]         0.54       0 0.18  0.19  0.42  0.54  0.66  0.89 37538    1
## def[19]         0.33       0 0.19 -0.04  0.21  0.34  0.46  0.70 39414    1
## def[20]         0.20       0 0.19 -0.17  0.07  0.20  0.32  0.56 37993    1
## 
## Samples were drawn using NUTS(diag_e) at Thu May 23 16:24:21 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Online model

## Inference for Stan model: anon_model.
## 4 chains, each with iter=11000; warmup=1000; thin=1; 
## post-warmup draws per chain=10000, total post-warmup draws=40000.
## 
##                 mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## mu             -0.32       0 0.02 -0.35 -0.33 -0.32 -0.31 -0.29 57363    1
## home_advantage  0.15       0 0.02  0.12  0.14  0.15  0.16  0.18 55709    1
## att[1]          0.78       0 0.03  0.71  0.75  0.78  0.80  0.84 54325    1
## att[2]          0.11       0 0.05  0.02  0.08  0.11  0.15  0.21 57006    1
## att[3]          0.01       0 0.05 -0.09 -0.02  0.01  0.05  0.12 54203    1
## att[4]          0.03       0 0.05 -0.06  0.00  0.03  0.06  0.12 54559    1
## att[5]          0.21       0 0.05  0.11  0.17  0.21  0.24  0.30 51632    1
## att[6]          0.37       0 0.05  0.27  0.33  0.37  0.40  0.47 50910    1
## att[7]          0.49       0 0.04  0.42  0.46  0.49  0.51  0.56 57498    1
## att[8]          0.44       0 0.04  0.36  0.41  0.44  0.47  0.52 53710    1
## att[9]         -0.64       0 0.08 -0.79 -0.69 -0.64 -0.59 -0.49 51317    1
## att[10]         0.26       0 0.05  0.16  0.22  0.26  0.29  0.36 58589    1
## att[11]         0.40       0 0.04  0.32  0.37  0.40  0.43  0.48 58019    1
## att[12]         0.79       0 0.04  0.72  0.76  0.79  0.81  0.86 52849    1
## att[13]         0.57       0 0.04  0.49  0.54  0.57  0.60  0.65 57625    1
## att[14]         0.11       0 0.04  0.02  0.08  0.11  0.14  0.20 54968    1
## att[15]        -2.18       0 0.12 -2.41 -2.26 -2.18 -2.10 -1.95 50041    1
## att[16]         0.01       0 0.05 -0.09 -0.02  0.01  0.05  0.12 53968    1
## att[17]         0.40       0 0.04  0.32  0.37  0.40  0.43  0.48 49641    1
## att[18]        -0.69       0 0.09 -0.87 -0.75 -0.69 -0.63 -0.52 51054    1
## att[19]        -0.42       0 0.07 -0.56 -0.47 -0.42 -0.38 -0.29 51023    1
## att[20]        -1.03       0 0.22 -1.47 -1.18 -1.03 -0.88 -0.61 62394    1
## def[1]         -0.62       0 0.07 -0.75 -0.66 -0.62 -0.57 -0.49 56236    1
## def[2]         -0.16       0 0.05 -0.26 -0.20 -0.17 -0.13 -0.07 48733    1
## def[3]          0.41       0 0.04  0.33  0.38  0.41  0.43  0.48 51272    1
## def[4]         -0.25       0 0.05 -0.35 -0.28 -0.25 -0.21 -0.15 53831    1
## def[5]          0.41       0 0.04  0.33  0.38  0.41  0.43  0.48 53237    1
## def[6]          0.34       0 0.04  0.25  0.31  0.34  0.37  0.42 53148    1
## def[7]         -0.83       0 0.06 -0.95 -0.87 -0.83 -0.79 -0.71 50939    1
## def[8]          0.01       0 0.05 -0.08 -0.02  0.01  0.04  0.10 52022    1
## def[9]          0.32       0 0.04  0.25  0.30  0.32  0.35  0.39 56688    1
## def[10]         0.53       0 0.04  0.45  0.50  0.53  0.55  0.60 50807    1
## def[11]        -0.31       0 0.06 -0.42 -0.35 -0.31 -0.28 -0.21 54624    1
## def[12]         0.43       0 0.04  0.34  0.40  0.43  0.45  0.51 52231    1
## def[13]         0.30       0 0.04  0.22  0.27  0.30  0.33  0.39 57413    1
## def[14]        -0.98       0 0.06 -1.11 -1.02 -0.98 -0.94 -0.86 53674    1
## def[15]         0.06       0 0.04 -0.01  0.04  0.06  0.09  0.14 49967    1
## def[16]        -0.10       0 0.05 -0.19 -0.13 -0.10 -0.07  0.00 53460    1
## def[17]        -0.78       0 0.06 -0.90 -0.82 -0.78 -0.74 -0.66 50585    1
## def[18]         0.62       0 0.03  0.56  0.60  0.62  0.65  0.69 59261    1
## def[19]         0.36       0 0.04  0.29  0.33  0.36  0.38  0.43 53456    1
## def[20]         0.25       0 0.13 -0.01  0.16  0.25  0.34  0.50 64327    1
## 
## Samples were drawn using NUTS(diag_e) at Thu May 23 15:21:00 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

In the summary, all Rhat values are equal to 1, indicating convergence of the chains (remember the rule of thumb, \(\hat{R}< 1.1\)). Additionally, the effective sample size (n_eff) values are high, which is a positive indicator. Notably, the n_eff values for the online model are generally higher than those for the offline models, which was something that I somehow expected. Interestingly, many n_eff values (all, actually, for the online model) are greater than the total number of samples set during the fitting process. At first, this seemed paradoxical to me. However, after reading various discussions on the Stan developers forum, I found the following answer:“Hi! It means that the draws that Stan is producing are better than independent draws for those parameters. Sounds like a joke, but apparently Stan is actually that awesome. :)”. (I leave the link to the discussion here)

Thanks to the bayesplot package, we can also plot the traceplots to visually inspect for divergent transitions. In our case, no chains had diverged, so the plots are not particularly interesting. However, if there had been any divergent chains, they would be evident in these plots. As an example, I present the traceplots for the attack and defense coefficients from the online model:

## No divergences to plot.
## No divergences to plot.

After examining the Gelman-Rubin statistic and the effective sample size, another important diagnostic for assessing convergence is the autocorrelation function (ACF) plot. This plot helps to inspect the autocorrelations within the chain samples. Upon reviewing all the ACF plots, I did not observe any particularly concerning patterns. To provide a comprehensive overview, I will present ACF plots just for two parameters: one with greater variability (the defense coefficient for Venezia FC) and one with less variability (the attack coefficient for Inter). These examples illustrate that regardless of the variability, there are no significant autocorrelation patterns, indicating good convergence properties of the model


Model checking and comparison

So far we have constructed a Bayesian model, computed the posterior distribution of all estimates, and conducted diagnostic tests —both quantitative and qualitative— regarding the convergence of the chains. Now, our attention turns towards evaluating the fit of the model to the data and our substantive knowledge.

Posterior predictive checks (in-sample replications)

As illustrated in the Bayesian Data Analysis book, a key technique for assessing model fit involves drawing simulated values from the posterior predictive distribution and comparing them to the observed data. More specifically, we’ll perform now an in-sample replication analysis. Here, we aim to assess whether the model adequately captures the observed data by assessing the plausibility of the observed data under the posterior predictive distribution. If the model is well-fitted, replicated data generated under its framework should closely resemble the observed data, while any discrepancy between the replicated and observed data may indicate model misfit or chance variations. If the model fits, then replicated data generated under the model should look similar to observed data. In summary, the observed data should look plausible under the posterior predictive distribution.

The analysis of model checking can be effectively conducted in a graphical/qualitative manner. Fortunately, our friend bayesplot offers a suite of graphical solutions to assess the posterior predictive distribution for our in-sample analysis. Before delving into the graphical model checking plots, a preliminary step is required. Due to limitations in defining the generated quantities block in Stan, as discussed earlier, it was necessary to simulate these quantities in R.

Note: In conducting the in-sample replication procedure, we assess the plausibility of observed data relative to replications drawn from the model trained with such data. While it is possible to perform this check match day by match day, I opted to utilize our more comprehensive model—the one estimated after the final match of the season. By employing this approach, the observed results encompass the entirety of the season’s data, providing a robust assessment of model fit across the entire dataset.

PPC for offline models

#-------------------------------------------------------------------------------
# Dataframe were to store the simulated GD
GD_df<- data.frame(matrix(NA,nrow = N_CHAINS*(N_ITERS-N_WARMUP),ncol=n_games))
# Load the final model (end of season)
load(paste0(OFFLINE_MODELS_DIR,"matchday",n_matchdays,
            "/KN_matchday",n_matchdays,".rds"))
#-------------------------------------------------------------------------------
# Retrieve some "global" parameters
posterior<- as.array(KN_model)
mu = posterior[,,"mu"]
home=posterior[,,"home_advantage"]
p = posterior[,,"p"]
#-------------------------------------------------------------------------------
for (m in 1:n_games){
  cat(m,"\n")
  ht=SerieA_data$ht[m]
  at=SerieA_data$at[m]
  attH=posterior[,,paste0("att[",ht,"]")]
  defH=posterior[,,paste0("def[",ht,"]")]
  attA=posterior[,,paste0("att[",at,"]")]
  defA=posterior[,,paste0("def[",at,"]")]
  
  theta_H = exp(mu+home+attH+defA)
  theta_A = exp(mu+attA+defH)
  
  GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
  
  GD_df[,m]<- GD
  colnames(GD_df)[m]<-paste0(teams[ht],"-vs-",teams[at])
}
#-------------------------------------------------------------------------------
color_scheme_set("teal")

ppc_dens_overlay_offline <- ppc_dens_overlay(y=SerieA_data$GD,
                                             yrep=as.matrix(GD_df[1:4000,]))

ppc_ecdf_overlay_offline <- ppc_ecdf_overlay(y=SerieA_data$GD,
                                             yrep=as.matrix(GD_df[1:10000,]))

ppc_stat_mean_offline<- ppc_stat(y = SerieA_data$GD,
                                 yrep=,as.matrix(GD_df[1:40000,]),
                                 stat = "mean",bins = 20)

propzero <- function(x) mean(x == 0)
ppc_stat_propzero_offline <- ppc_stat(y = SerieA_data$GD,
                                      yrep=as.matrix(GD_df[1:40000,]),
                                      stat = "propzero",bins = 20)

ppc_stat_2d_offline <- ppc_stat_2d(y = SerieA_data$GD,
                                   yrep=as.matrix(GD_df[1:40000,]),
                                   stat = c("mean","sd"))

PPC for online models

#-------------------------------------------------------------------------------
# Dataframe were to store the simulated GD
GD_df<- data.frame(matrix(NA,nrow = N_CHAINS*(N_ITERS-N_WARMUP),ncol=n_games))
# Load the final model (end of season)
load(paste0(ONLINE_MODELS_DIR,"matchday",n_matchdays,
            "/KN_matchday",n_matchdays,".rds"))
#-------------------------------------------------------------------------------
# Retrieve some "global" parameters
posterior<- as.array(KN_model)
mu = posterior[,,"mu"]
home=posterior[,,"home_advantage"]
p = posterior[,,"p"]
#-------------------------------------------------------------------------------
for (m in 1:n_games){
  
  ht=SerieA_data$ht[m]
  at=SerieA_data$at[m]
  attH=posterior[,,paste0("att[",ht,"]")]
  defH=posterior[,,paste0("def[",ht,"]")]
  attA=posterior[,,paste0("att[",at,"]")]
  defA=posterior[,,paste0("def[",at,"]")]
  
  theta_H = exp(mu+home+attH+defA)
  theta_A = exp(mu+attA+defH)
  
  GD<- mapply(my_rzeroinflatedskellam, 1,theta_H, theta_A,p)
  
  GD_df[,m]<- GD
  colnames(GD_df)[m]<-paste0(teams[ht],"-vs-",teams[at])
}
#-------------------------------------------------------------------------------
color_scheme_set("purple")

ppc_dens_overlay_online <- ppc_dens_overlay(y=SerieA_data$GD,
                                            yrep=as.matrix(GD_df[1:4000,]))

ppc_ecdf_overlay_online<- ppc_ecdf_overlay(y=SerieA_data$GD,
                                           yrep=as.matrix(GD_df[1:10000,]))

ppc_stat_mean_online<- ppc_stat(y = SerieA_data$GD,
                                yrep=,as.matrix(GD_df[1:40000,]),
                                stat = "mean",bins = 20)

propzero <- function(x) mean(x == 0)
ppc_stat_propzero_online <- ppc_stat(y = SerieA_data$GD,
                                     yrep=as.matrix(GD_df[1:40000,]),
                                     stat = "propzero",bins = 20)

ppc_stat_2d_online <- ppc_stat_2d(y = SerieA_data$GD,
                                  yrep=as.matrix(GD_df[1:40000,]),
                                  stat = c("mean","sd"))


Posterior predictive checks (out-sample replications)

Qui ho fatto una cosa diversa da quelle dei libri….posso fare previsioni out of sample per ogni giornata (col modello precedente)…infatti ho usato questo per simulare i rankings e anche per ottenere delle matrici di confusione (e in realta anche brier score)

Conclusions and further developments